Accelerating Growth and Size-dependent Distribution of Human Online Activities 
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Research on human online activities usually assumes that total activity T increases linearly with 
active population P, that is, T oc P J (-y = 1). However, we find examples of systems where total 
activity grows faster than active population. Our study shows that the power law relationship 
T oc P 7 (7 > 1) is in fact ubiquitous in online activities such as micro-blogging, news voting and 
photo tagging. We call the pattern "accelerating growth" and find it relates to a type of distribution 
that changes with system size. We show both analytically and empirically how the growth rate 7 
associates with a scaling parameter b in the size-dependent distribution. As most previous studies 
explain accelerating growth by power law distribution, the model of size-dependent distribution is 
novel and worth further exploration. 
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I. INTRODUCTION 

There are two ways of describing the growth of human 
online activities: linear and nonlinear. Linear models 
assume that the average level of individual activities is a 
constant. Hence, the total amount of activity T is a linear 
function of P, namely, T oc -P 7 (7 = 1)- For instance, the 
average degree is a constant parameter in the Barabasi - 
Albert (BA) model However, in nonlinear models the 
average level of individual activities increase with system 
size [2]. leading to a power law relationship T oc P' ! (7 > 
1). This relationship is supported by empirical studies 
on different online activities including game playing [3|, 
resource recommendation Q , collaborative programming 
@ and tagging d, 0] , as well as off-line activities such as 
energy consumption [8Hl~iT] and wealth creation [12, EH ■ 

Although there is plenty of evidence for nonlinear 
growth (referred to hereafter as "accelerating growth"), 
how such growth arises is still an open question. In 
the current study, we find a power law relationship 
T oc P 7 ("f > 1) in serval types of online activities rang- 
ing from micro-blogging, news voting to photo tagging. 
While previous authors have explained it with a long- 
tail distribution [B|, 0, fT3 - [l6j j , we propose that it is the 
"size-dependent distribution" , rather than a long-tail dis- 
tribution, that gives rise to accelerating growth. It is ob- 
served that the behavior of highly active users confirms 
to Zipf 's law, while the behavior of less active users does 
not. Therefore Zipf's law, also known as the rank curve 
of power law distribution, fails to capture a regularity in 
human online activities. Instead of Zipf's law, we use 
a discrete generalized beta distribution (DGBD) [TtJ to 
fit the curves. By tuning two parameters a (which deter- 
mines the activities of highly active users and corresponds 



to the exponent a in Zipf's law) and b (which determines 
the activities of the less active users) in the DGBD, we 
are able to fit the empirical curves with R 2 > 0.9. Fur- 
thermore, it is observed that the rank curves of individual 
activities change with population size, and such a correla- 
tion can be controlled by adding a scaling factor P b in the 
DGBD function. We call the modified DGBD function 
the "size-dependent distribution" and derive the relation- 
ship T oc P 7 (7 > 1) from it analytically. We therefore 
find that the accelerating growth rate 7 is not determined 
by the exponent in Zipf's law (or a power law distribu- 
tion) , as claimed by previous studies 0, 0, EH EH , but is 
in fact related with the size dependent exponent b. 



II. ACCELERATING GROWTH IN HUMAN 
ONLINE ACTIVITIES 

Our data cover several types of typical human online 
activities, including the micro-blogging activities of 6,426 
users on Jiwai, the news voting activities of 139,409 
users on Digg, the photo tagging activities of 195,575 
users on Flickr, and the book tagging activities of 13,988 
users on Delicious. All the four data sets are publicly 
available. The Jiwai data set is published in [201 and is 
available at http://www.fanpq.com/. The Digg data 
set is published in (lslj and can be downloaded from 
|http : //www . isi . edu/ integration /people/lerman/downloads . h 
The Flickr and Delicious data sets are pub- 
lished in UM and can be downloaded from 



http : / / www . uni-koblenz-landau . de/koblenz/f b4/AGStaab/Res 



' zhangjiang@bnu.edu.cn 



In the four systems, we define P as the number of active 
users in a day, and T the total activity generated by 
these users. Note that the unit of T is different across 
systems: T are micro-blogs in Jiwai, news votes in 
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TABLE I: Estimates of accelerating growth. 

("N of Days" is the number of observations of the given dataset and "CI" stands for confidence interval.) 
Activity Dataset 7 95%C7 Adjusted H 2 N of Days URL 

Photo tagging Flickr 1.48 [1.43, 1.54] 093 193 ftickr.com 

Micro-blogging Jiwai 1.19 [1.03, 1.48] 0.98 21 rn.jiwai.de 

News voting Digg 1.18 [1.06, 1.22] 1.00 31 digg.com/news 

Book tagging Delicious 1.17 [1.15, 1.19] 0.93 663 delicious.com 
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FIG. 1: (Color online) Accelerating growth in human 
online activities. Different datasets are marked in points 
of different colors and shapes (green squares for Flickr, 
orange diamonds for Delicious, red circles for Jiwai, and 
blue triangles for Digg). The x axis shows the active 
population in a day and the y axis shows the total 
activity in the day. Both axes have a logarithmic scale. 
The orthogonal regression lines are also shown. 



tween 1 and 1.35 9]. As one of the 7 in our study exceeds 
1.40 (Flickr), this is evidence that online systems could 
be more productive than off-line systems. 

Our analysis of four online systems provides varying 
estimates of 7. But what determines the 7 ? In explor- 
ing the underlying distribution of individual activities, 
we discover a new type of distribution that changes with 
system size. In the following section, we introduce the 
size-dependent distribution, which we propose relates to 
accelerating growth in human online activities. In par- 
ticular, we show how the exponent b of the distribution 
can be used to predict the value of 7. 



III. FROM SIZE DEPENDENT DISTRIBUTION 
TO ACCELERATING GROWTH 

In this section, we show that the DGBD, which has 
been extensively studied for various data sets in [l7| . 
fits the daily distributions of individual activities in the 
four online systems. We then show that the distribu- 
tion changes with system size and this correlation can be 
captured by a scaling parameter P b . In other words, by 
adding P b to the distribution function, we can use it to 
fit the empirical distributions in different days. We call 
the modified distribution the size-dependent distribution 
and show that accelerating growth in total activity can 
be derived from that distribution. 



Digg, and tags in Delicious and Flickr. The systems are 
summarized in Table HI 

We plot T and P in a log-log scale plot (FigO} and 
find a power law relationship 

TcxP 7 . (1) 

The values of 7 estimated by orthogonal regression are 
shown in Table |TJ We use orthogonal regression instead of 
ordinary least squares regression because the latter tends 
to overstate the effect of outliers . The estimated val- 
ues of 7 shown in Table Q] are all greater than 1. By noting 
that greater 7 means more activities would be generated 
by a given population, we can regard 7 as an indicator 
of productivity and compare the productivity between 
online and off-line systems. Off-line activities, including 
wealth creation and patent invention, have been found 
to scale linearly with population, 7 estimated to be be- 



A. The DGBD model 

We use t(r) to denote the activity of a user in one day, 
in which r is the decreasing rank of the activity among 
all individual activities in the day. Thus the maximum 
value of r, r max , equals population P. The DGBD model 
pj} of individual activities is then 

t(r) = A(P + 1 - r) b r- a (a>0,6>0), (2) 

where A, b and a are parameters to be estimated. In 
Fig. [2J we show three examples of daily distributions for 
each system. Note that if we set b = then Eq. be- 
comes Zipf's law t{r) cx r~ a . Therefore, the DGBD can 
be viewed as a generalized Zipf's law (or power law distri- 
bution). The reason for using the DGBD instead of Zipf's 
law, which has been widely used to fit human behavioral 
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FIG. 2: (Color online) Three examples of daily distribution of individual activities: Flickr (a), Digg (b), Delicious 

(c), and Jiwai (d). Different colors and shapes of the data points indicate different days. The y axis shows the 
individual activities and the x axis shows the decreasing ranks of the activities. Both axes have a logarithmic scale. 
The rescaled form of example distributions and the theoretical curves predicted by the size-dependent DGBD model 
are shown in semi-log plots (in which the y axis has a logarithmic scale) in the insets. 



data exhibiting a long tail [TH [T|| [I?], HH , is that in our 
data, the rank curves in the rank-ordered plots (individ- 
ual activity vs. rank) are not perfectly straight lines. 
The empirical curves deviate from the straight line pre- 
dicted by Zipf's law at the right tails (FigJ5]), and hence 
estimations based on Zipf's law will be biased. 



B. The size-dependent DGBD model 

In fitting the daily distributions of individual activities 
with the DGBD, we find that the parameter A changes 
with population size. To confirm this empirical finding, 
we analyze the DGBD function and find that the relation- 
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TABLE II: Estimations of the size-dependent DGBD 
models. 



that is: 



Dataset 


b 


a 


Adjusted K z 


N of Days 


Flickr 


0.54 


0.97 


0.97 


193 


Jiwai 


0.04 


0.90 


0.96 


21 


Digg 


0.06 


0.85 


0.94 


31 


Delicious 


0.15 


0.91 


0.94 


663 



ship between A and P can be derived as follows. As r max 
equals P and the minimum value of individual activities 
t(P) is 1 by construction (because we only consider active 
users), we derive the boundary condition 

t{r max ) = t(P) = 1. (3) 

Substituting Eq. [3] into Eq. [2] gives 

A = P a . (4) 

Let k = r/(P+l). Obviously k G (0, 1) and 1-fc G (0,1). 
By replacing r in Eq. (j2J with k we get 



t(k) = P a (P + l) b - a (l - k)"k- a 



(5) 



As in the data P ^> 1, namely, P+l « P, we can rewrite 
Eq. © as 



t(k)^P"{l~k) b k- a 



ln(t(fc)) » 61n(P(l - fc)) - aln(fc). 



(6) 



(7) 



Eq. ([6]) controls the variance of system size by replacing 
A with a scaling factor P b , as well as normalizing rank r 
into fc. We refer to Eq. as the size-dependent DGBD 
model and estimate its parameters a and b by ordinary 
least squares regression (TableHljl. The large adjusted R 2 
is evidence that the size-dependent DGBD model cap- 
tures the dynamic properties of human online activities 
very well. Another way to validate the size-dependent 
DGBD model is to plot t(k)/P b vs. k = r/(P + 1) in 
different days and check whether the relationship Eq. ([6]) 
rises from data. The insets in Figure [5] shows that em- 
pirical distributions in different days collapse to the same 
theoretical curves predicted by Eq. ©, thus our deduc- 
tion is empirically supported. 

We can derive the probability density function f(x) 
of individual activity from the size-dependent function 
Eq. ^ as follows. We know that the cumulative function 
C(x) — Pr{X > x} of the activity is the inverse function 
of the rank-activity curve t(k), namely, 



C{x)=r\x) = h{— b ), 



(8) 



where function h{x) is the inverse function of (1 — k) b k a , 



h~\k) = (1 - k) b k~ a 



(9) 



The probability density function f(x) can be derived 
from Eq. ((SJ as 

f( X ) = -dC(x)/dx = -rL h >(JL). (10) 

Setting g(x) = —h'(x) in Eq. (|10p gives a generalized 
form of the probability density function: 



(11) 



which has already been found in off-line systems such as 
the stock market and the income distribution |13, 23.1241. 



C. From the size-dependent DGBD model to 
accelerating growth 

Accelerating growth can be derived from the size- 
dependent DGBD model as follows. The integration of 
all user activities t(r) is total activity T, that is, 



T 



t(r) dr 



(P + l) / t(jfc)dJfc 

P b+1 [ (l-k) b k- a dk. 
Jo 



(12) 



Using Euler integration, we can rewrite Eq. (|12|) as 

Jt+1 r(i-a)r(i + b) 



T« P" 



T(2-a + b) ' 



(13) 



where T is the gamma function. As b and a are constants, 
according to the definition of the gamma function, we 
can replace ^f7^~f~pg^ with a constant C and further 
rewrite Eq. (TT3"|) as 



T « CP 



6+1 



(14) 



Eq. (|14p is the accelerating growth relationship. By com- 
paring Eq. (|TJ| with Eq. p4|). it is apparent that 



7 w b+ 1. 



(15) 



Note that the value of 7 only relates to b. As mentioned 
above, Eq. (|2|) becomes Zipf 's law when 6 = 0. Therefore 
Zipf's law leads to 7 = 1, namely, linear increase of total 
activity with the growth of population. Moreover, as it is 
the size-dependent parameter P b in Eq. ([6]) that leads to 
accelerating growth, any distribution independent of sys- 
tem size, including power law distribution predicted by 
the BA model [l|, can not result in accelerating growth. 
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TABLE III: The comparison between theoretical and 
empirical values of 7. 



Datasct 


7 within 95%C7 


7' 


b 


a 


Flickr 


[1.43, 1.54] 


1.54 


0.54 


0.97 


Jiwai 


[1.03, 1.48] 


1.04 


0.04 


0.90 


Digg 


[1.06, 1.22] 


1.06 


0.06 


0.85 


Delicious 


[1.15, 1.19] 


1.15 


0.15 


0.91 



As [13] reports the finding of the DGBD with a parame- 
ter b > in various empirical data sets, it is reasonable 
to conjecture the wide existence of accelerating growth, 
as we have shown that 7 = b + 1 > 1. 

To validate Eq. (|15l) . we can compare the theoretical 
and empirical values of 7. Table Unl shows 7 estimated 
from empirical data (from Table |T| and 7' that is the the- 
oretical value of 7 predicted by b (from Table HI)) . It is 
observed that the values of 7 and 7' are consistent with 
each other: all 7' fall into the 95% CI of 7. Therefore 
our analytical deduction of the relationship between size- 
dependent distribution and accelerating growth is justi- 
fied. 

It should be noted that the generalized form of the 
size-dependent probability density function Eq. (jlip is a 
sufficient condition of accelerating growth, meaning that 
if we replace g{x) with other functions, we can still obtain 
the power law relationship between system size P and 
total activity T [l3|. This finding is consistent with our 
previous study on income distributions of countries [l3j|. 

IV. CONCLUSIONS 

In this paper, we discuss accelerating growth in human 
online activities, that is, a power law relationship be- 
tween total activity and active population with an expo- 
nent greater than 1. The power law relationship is found 
to be ubiquitous across different types of human online 
behaviors. We show analytically how size-dependent dis- 
tribution relates to accelerating growth, and validate our 
deduction using several large data sets containing mil- 



lions of human online activity records. 

The major theoretical contribution of this paper is the 
finding that size-dependent distribution relates to accel- 
erating growth quantitatively. Although our study is 
based on human online activities, this quantitative re- 
lationship is not necessarily confined to an online con- 
text. The model may also be used to explain accelerating 
growth patterns in off-line social systems such as cities 
Q and countries [HI, [l3| . 

Beside the theoretical contribution, the model of size- 
dependent distribution has potential applications, e.g., 
in web crawling and website management. For example, 
with historical data on individual activities, webmasters 
can estimate the value of b and predict the accelerating 
growth rate 7 of total activity, which may help webmas- 
ters plan the capacity of web server accordingly. Web- 
masters can also compare the values of 7 among websites 
with equivalent functions, leading to an innovative theo- 
retically informed approach of benchmarking. 

It should be noted that our findings appear to con- 
tradict conclusions of previous studies. For example, 
[E S HE EH suggest that the exponent 7 in accelerating 
growth is determined by the exponent a in Zipf's law, 
but our analysis suggests that a power law distribution 
that is independent of system size will not lead to accel- 
erating growth. We conjecture this contradiction may be 
due to an unknown relationship between the parameters 
a (which, as mentioned, corresponds to the exponent a 
in Zipf's law) and b in the DGBD model, since we have 
shown that 7 = 6+1. The unknown relationship be- 
tween a and 6, together with other unsolved problems 
such as the behavioral origins of size-dependent distribu- 
tions, call for further exploration. 
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